Improving RNA secondary structure prediction using direct coupling analysis
He Xiaoling, Wang Jun, Wang Jian, Xiao Yi
School of Physics, Huazhong University of Science and Technology, Wuhan 430074, China

 

† Corresponding author. E-mail: yxiao@hust.edu.cn

Project supported by the National Natural Science Foundation of China (Grant No. 31570722).

Abstract

Secondary structures of RNAs are the basis of understanding their tertiary structures and functions and so their predictions are widely needed due to increasing discovery of noncoding RNAs. In the last decades, a lot of methods have been proposed to predict RNA secondary structures but their accuracies encountered bottleneck. Here we present a method for RNA secondary structure prediction using direct coupling analysis and a remove-and-expand algorithm that shows better performance than four existing popular multiple-sequence methods. We further show that the results can also be used to improve the prediction accuracy of the single-sequence methods.

1. Introduction

Secondary structures of RNAs are widely used in the studies of their tertiary structures and functions and become more and more important due to the discovery of increasing number of noncoding RNAs in biological processes.[16] Many methods of RNA secondary structure prediction have been proposed in the last decades.[7,8] These methods can be divided into two categories: one is based on single sequence,[9,10] and the other is on multiple sequences.[7] For the single-sequence approach, the most widely used methods are based on Nissinov algorithm[11] and minimum free energy (MFE) model.[10,1214] The accuracies of these methods are about 70% in both precision (PPV) and sensitivity (STY).[15,16] For the multiple-sequence approach, structure conservation between homologous sequences from the same RNA family is utilized to infer their common secondary structure.[7] The PPVs of these methods are usually higher than those based on free-energy minimization, about 70%–80%, but the STYs are lower and usually less than 70%.[7,17] Therefore, accurate prediction of RNA secondary structure remains a challenge.

The secondary structure of an RNA is a specific base-pairing pattern formed by the canonical and wobble base pairs (A-U, G-C, and G-U) in its native structure. For convenience, we call these three types of base pairs as the standard base pairs in the following. Pseudoknots are not considered in this work and will be considered as tertiary interactions. The challenge of secondary structure prediction is to find this specific base-pairing pattern from a huge number of possible ones. The MFE model assumes that this specific base-pairing pattern is a minimum free energy state but in practice it is difficult to obtain accurate free energy of a base-pairing pattern. However, it was shown that the accuracy of the single-sequence MFE method could be increased by experimental constraints, like SHAPE data.[18,19] On the other hand, for the multiple-sequence approach, the base-pairing pattern might be determined from coevolutionary base pairs (co-pairs for short) inferred from the homologous sequences of the target RNA.[20,21] At present, there are many methods for inferring the co-pairs, like direct coupling analysis (DCA).[2224] If using the co-pairs inferred by these methods to directly predict RNA secondary structures, the performance is only comparable to the existing methods. However, it was shown that combining with the generalized Nussinov algorithm, DCA could predict RNA secondary structures with significant improvement in STY without reducing PPV in comparison with the mutual-information (MI) method according to the test on six RNAs.[20]

Here we present a method that combines DCA with a remove-and-expand algorithm to predict RNA secondary structure and benchmark it on a large test set. We call our method as 2dRNAdca, in consistent with our prediction method of three-dimensional structure of RNA, 3dRNA.[21,2527] Furthermore, we show that the results of 2dRNAdca can be used to improve the prediction accuracy of the single-sequence MFE method.

2. Methods
2.1. Direct coupling analysis

The 2dRNAdca is based on the co-pairs inferred from the homologous sequences of the target RNA by DCA.[2224] There are other models to infer the co-pairs, e.g., using MI of a multiple sequence alignment.[28] The shortage of MI is that the predicted co-pairs contain many pairwise residues that are not in direct contacts in tertiary structure.[24,29] DCA was proposed to disentangle direct contacts from indirect ones.[24]

The basic principle of DCA is briefly presented according to the detailed description in our previously published papers.[21,30] For a target RNA sequence, we can do multiple sequence alignment (MSA) with its homologous sequences from the same RNA family. The MSA can be represented as , where the residue with i = 1,…,5 can be the nucleotides A, C, G, U or gap “–” and M is the number of the homologous sequences including the target RNA sequence A1AL. DCA assumes that the occurrence probability P(A1,…,AL) of the target RNA sequence within MSA satisfies the maximum-entropy model with the constraints that single and pair probabilities Pi(Ai) and Pij(Ai,Aj) being determined by

This leads to a Potts model

where eij(Ai,Aj) and hi(Ai) are related to the pairwise and single energies of the nucleotides, and Z is the partition function in the form

In the Potts model, eij(Ai,Aj) is just the direct interaction or direct coupling what we want to calculate and so to infer it is a problem of maximum likelihood. Since the terms in the partition function increase exponentially with the sequence length, different approximations have been proposed in practice to deal with this problem. In this work, DCA under mean field approximation (mfDCA) is used. In this case, the partition function Z is expanded as Taylor’s series around zero coupling and it is found when keeping the linear term, the direct interaction between nucleotides can be estimated by the inverse of the covariance matrix

where Cij(Ai,Aj) = fij(Ai,Aj) – fi(Ai)fj(Aj) is the covariance matrix and fi(Ai) and fij(Ai,Aj) are observed single and pair probabilities given the MSA.

DCA calculates a score for each residue pair in the target RNA sequence. Usually, the pairs of the N largest scores are considered as co-pairs, e.g., with N equal to or less than the length of the target RNA. The DCA scores for the target RNA can be input by users or can be calculated by using the option in our 2dRNAdca web server. Multiple sequence alignment for a target RNA sequence is required for prediction of co-pairs using DCA and is generated from Rfam database[31,32] in this work. There are other DCA algorithms, e.g., using a global probability model through pseudo-likelihood maximization (plmDCA)[33,34] to infer the co-pairs.

2.2. Remove-and-expand algorithms

The co-pairs inferred by DCA usually include many non-standard base pairs, one-to-many base pairs, pseudoknotted base pairs, and single base pair (helix with only one base pair). Since non-standard base pairs and pseudoknotted base pairs are not considered in this work and one-to-many base pairs and single base pair rarely occur in a RNA secondary structure,[35] they can be considered as false positives. 2dRNAdca adopts a “remove-and-expand” algorithm to remove these false positives among the co-pairs inferred by DCA and it includes two steps:

Removing step. Take the top N standard (A-U, G-C, and G-U) co-pairs from the co-pairs inferred by DCA and remove the single base pair and one-to-many base pairs in them. However, if a single base pair can be expanded (see the following expanding step), it will be retained. Similarly, if some base pairs in one-to-many base pairs can be expanded, the one that can form the longest stem will be retained.

Expanding step. The remaining co-pairs after the removing step are considered as predicted base pairs of the target RNA and can be taken as the “cores” of the native secondary structure. Expand or maximize the standard base pairs from the predicted unpaired bases if they can form a continuous stacking with the predicted base pairs. During the expanding process, the hairpin loops are kept to have at least three bases. During expanding, if one base can form one more base pairs with different bases, the algorithm chooses the one that can form the longest stem. After the expanding process, the pseudoknotted base pairs are removed. The final structure is considered as the predicted secondary structure of the target sequence.

In Fig. 1, we give an example to show how the remove-and-expand algorithm combines with DCA in 2dRNAdca. The RNA in this example is an RNA (PDB ID: 2KDQ_B) with a length L of 29 nucleotides. Figure 1(a) shows the top 0.2L standard co-pairs (red and green pairs) inferred by mfDCA. In them there is a single base pair and two one-to-two base pairs. Figure 1(b) shows the result after the removing step. Since the single base pair and one base pair in each of the two one-to-two base pairs can be expanded, three co-pairs remain. Then we perform the expanding step and in this process seven standard base pairs are added (Fig. 1(c)). Eventually ten base pairs are predicted and they all are native base pairs. These results show that the remove-and-expand algorithm can significantly improve the prediction accuracy of mfDCA.

Fig. 1. Working flow of the remove-and-expand algorithm in 2dRNAdca for an RNA (PDB ID: 2KDQ_B). The base pair connected by red lines are the native ones. (a) The bases in the top 0.2L standard co-pairs inferred by mfDCA are in red and green, where L is the length of the RNA. In these co-pairs, there are a single base pair (in red) and two one-to-two base pairs that each includes a base pair connected by a red line and a base pair connected by a green line (non-native base pair). The single base pair and the base pairs connected by red lines in the two one-to-two base pairs can be expanded and are retained after the removing step. (b) The result after the removing step. (c) The expanded result of (b) where the expanded base pairs are in light blue.
2.3. Datasets

Two test sets are used in this work. The test set I consists of six RNAs (PDB IDs: 1Y26, 2GDI, 2GIS, 3IRW, 3VRS, 3OWI) (Table 1) that were used in a previous work.[20] The test set II originally consists of 107 RNAs from PDB that were used to benchmark various methods of RNA secondary structure prediction.[7] The sequences in the testing set whose similarity are more than 85% with sequences in the training dataset and those that form base pairs with other sequences are removed. Finally, the testing set contains 94 RNAs with sequence lengths 29–377 nucleotides and the numbers of their homologous sequences are from 6 to 1023 (Table 2). This test set and the benchmark results of various methods on it are available for download from the PDB database in the data sets section on the website: https://iimcb.genesilico.pl/comparna/.

Table 1.

Performance of mfDCA and 2dRNAdca on the test set I.

.
Table 2.

Parameters of the test set II of 94 RNAs.

.
2.4. Evaluation of performance

We use precision (PPV), sensitivity (STY), and Matthews correlation coefficients (MCC)[36,37] to measure the performance of the methods to predict RNA secondary structures as usual.[7,17] PPV measures the fraction of predictions that are native base pairs, STY measures the fraction of native base pairs that are predicted out, and MCC is a balanced measure of PPV and STY. They are defined as follows:

where TP denotes true positive; FP, false positive; TN, true negative; FN, false negative. The native base pairs are determined by using the same tool RNAView[38] as in CompaRNA[7] because we used the benchmark results of CompaRNA for CentroidAlifold, MXSCARNA, RNAalifold, and TurboFold in comparison with mfDCA and 2dRNAdca, although there are other tools annotating the native base pairs, e.g., DSSR[39] and RNApdbee.[40] Since the current version of our method does not consider pseudoknots, the pseudoknotted base pairs in the predicted and native structures are not taken into account in evaluation.

3. Results

Table 1 and figure 2 show the performance of 2dRNAdca over the test set I of six riboswitch RNAs that was used in a previous work.[20] It is clear that 2dRNAdca can further increase the performance of mfDCA in comparison with the method that combined DCA with a generalized Nussinov algorithm in both PPV and STY.[20] The mean PPV, STY, and MCC of the former all are 0.91 and are clearly higher than the latter. The top 0.3L standard co-pairs inferred by mfDCA are used for 2dRNAdca and mfDCA respectively (see the following), where L is the length of the target RNA.

Fig. 2. The performances of 2dRNAdca, DCA*, and mfDCA for six RNAs. DCA* is the method used in Ref. [20] that combined mfDCA with a generalized Nussinov algorithm. The mean values of PPV, STY, and MCC of DCA* for the six RNAs were calculated according to Fig. 3 in Ref. [20].

The 2dRNAdca is further benchmarked on the test set II (Table 2) that was used previously to benchmark various methods of RNA secondary structure predictions,[7] and is compared with the four popular multiple-sequence methods, CentroidAlifold,[41] MXSCARNA,[42] RNAalifold,[13] and TurboFold.[43] These four methods are selected because they are the best multiple-sequences methods according to the ranking on the test set II.[7]

We first study how the performances of mfDCA and 2dRNAdca change with the top N standard co-pairs (Fig. 3). We find that the performance (MCC) is the best when N is about 0.3L for both mfDCA and 2dRNAdca. In this case, the mean values of PPV, STY, and MCC are 0.59, 0.65, and 0.61 for mfDCA and 0.85, 0.80, and 0.81 for 2dRNAdca (Table 3). Therefore, 2dRNAdca performs much better than mfDCA.

Fig. 3. The performance of (a) mfDCA and (b) 2dRNAdca vs. the top N standard co-pairs with N = 0.1L, 0.2L, II, L over the test set II. L is the length of RNA.
Table 3.

Performance of different methods of RNA secondary structure prediction over the test set II.

.

For comparison, Table 2 also gives the performances of CentroidAlifod, MXScarma, RNAalifold, and TurboFold on the same test set. The results indicate that 2dRNAdca performs better than those methods in both PPV and STY and so MCC (Fig. 4). The STYs of CentroidAlifod and RNAalifold are very low in comparison with that of 2dRNAdca. It is worthy to point out that since RNAalifold and Turbofold predict nonzero number of standard base pairs only for a part of RNAs (59 and 35 out of 94) in the test set II, it is inappropriate to compare them with other multiple-sequences methods in Table 2 in fact. Furthermore, it is noted that TurboFold has been updated to TurboFold II but the performance of the latter in structure prediction is similar to the former[17] and so Table 3 only lists the benchmark result of TurboFold.

Fig. 4. The performances of 2dRNAdca, CentroidAlifod, MXScarma, RNAalifold, Turbofold, and mfDCA on the test set II.

Table 4 shows the performances of 2dRNAdca for different types of RNAs in the test set II. Among the six types of RNAs, 2dRNAdca has higher performance (MCC) for tRNA, riboswitch, and rRNA and has lower performance (MCC) for the other three types.

Table 4.

Performance of 2dRNA for different types of RNAs in the test set II.

.

Finally, we show that the results of 2dRNAdca can be used to improve the accuracy of single-sequence MFE methods. Here, we use the predicted base pairs of 2dRNAdca as the constraints for the MFE methods, including Mfold,[10] RNAfold,[13,44] and RNAstructure.[12] The predictions can be done by using the default parameters on the 2dRNAdca web server and the results (2dRNAdca + Mfold, 2dRNAdca + RNAfold, 2dRNAdca + RNAstructure) are also presented in Table 3. It shows that the accuracies of the MFE methods can be increased by at least 10%.

4. Discussion

It is known that the performance of multiple-sequences methods usually depends on the number of available homologous sequences. However, figure 5 shows that 2dRNAdca can greatly improve the performance of mfDCA even in the case with only a small number of homologous sequences, e.g., for an RNA with six homologous sequences (PDB ID: 2KUR_A), the PPV, STY, and MCC of mfDCA are 0.29, 0.21, and 0.23 while those of 2dRNA are 1.00, 0.84, and 0.92 both in top 0.3L co-pairs.

Fig. 5. The scatter plot of performances of (a) 2dRNAdca and (b) mfDCA with the number of homologous sequences in the test set II.

Figure 6 also shows the performance of 2dRNAdca and mfDCA with the sequence length of RNA for the test set II. We find that their performances have no clear dependence on the sequence length.

Fig. 6. The scatter plot of performances of (a) 2dRNAdca and (b) mfDCA with the sequence length for the test set II.

The performance of 2dRNAdca depends not only on the accuracy of mfDCA in inference of co-pairs, i.e., the number of true positives (native base pairs) in the top 0.3L co-pairs, but also whether these true positives distribute to all the stems (Fig. 7). For example, for 1KXK the true positives in the top 0.3L standard co-pairs do not distribute to some stems (see Fig. 7) and so the expanding step does not work for these stems, leading to a lower performance of 2dRNAdca. On the other hand, if the true positives in the top standard co-pairs distribute to all the stems, 2dRNAdca can give higher performance than mfDCA even the STY of the latter is very low.

Fig. 7. Some examples that the true positives in the top 0.3L standard co-pairs inferred by mfDCA do not distribute to some stems.
5. Conclusion

In summary, we propose a method, 2dRNAdca, for predicting RNA secondary structure by combining DCA with a remove-and-expand algorithm. The benchmark results show that 2dRNAdca can significantly increase the performance of mfDCA and it also performs better than existing popular multiple-sequences methods. Furthermore, we show that the results of 2dRNAdca can be used to improve the prediction accuracy of the single-sequence method. It is expected that the performance of 2dRNAdca will improve as the accuracy of DCA improves.

Reference
[1] Li X Bu D Sun L Wu Y Fang S Li H Luo H Luo C Fang W Chen R Zhao Y 2017 Curr. Protoc Bioinf. 58 12.16.1
[2] Zhao Y Li H Fang S Kang Y Wu W Hao Y Li Z Bu D Sun N Zhang M Q Chen R 2016 Nucleic Acids Res. 44 D203
[3] Yang Y Gu Q Zhang B Shi Y Shao Z 2018 Chin. Phys. 27 38701
[4] Shi Y Wu Y Wang F Tan Z 2014 Chin. Phys. 23 78701
[5] Bao L Zhang X Jin L Tan Z 2016 Chin. Phys. 25 18703
[6] Chang X Xu L Shi H 2015 Chin. Phys. 24 128703
[7] Puton T Kozlowski L P Rother K M Bujnicki J M 2013 Nucleic Acids Res. 41 4307
[8] Mathews D H Turner D H Watson R M 2016 Curr. Protoc Nucleic Acid Chem. 67 11.2.1
[9] Zuker M Stiegler P 1981 Nucleic Acids Res. 9 133
[10] Zuker M 2003 Nucleic Acids Res. 31 3406
[11] Nussinov R Pieczenik G Griggs J R Kleitman D J 1978 Siam J. Appl. Math. 35 68
[12] Bellaousov S Reuter J S Seetin M G Mathews D H 2013 Nucleic Acids Res. 41 W471
[13] Lorenz R Bernhart S H Siederdissen C H Z Tafer H Flamm C Stadler P F Hofacker I L 2011 Algorithm Mol. Biol. 6 26
[14] Janssen S Giegerich R 2015 Bioinformatics 31 423
[15] Doshi K J Cannone J J Cobaugh C W Gutell R R 2004 Bmc Bioinf. 5 105
[16] Zhao Y Wang J Zeng C Xiao Y 2018 Biophys. Rep. 4 123
[17] Tan Z Fu Y H Sharma G Mathews D H 2017 Nucleic Acids Res. 45 11570
[18] Deigan K E Li T W Mathews D H Weeks K M 2009 Proc. Natl. Acad. Sci. United States Am. 106 97
[19] Sukosd Z Swenson M S Kjems J Heitsch C E 2013 Nucleic Acids Res. 41 2807
[20] De Leonardis E Lutz B Ratz S Cocco S Monasson R Schug A Weigt M 2015 Nucleic Acids Res. 43 10444
[21] Wang J Mao K K Zhao Y J Zeng C Xiang J J Zhang Y Xiao Y 2017 Nucleic Acids Res. 45 6299
[22] Morcos F Pagnani A Lunt B Bertolino A Marks D S Sander C Zecchina R Onuchic J N Hwa T Weigt M 2011 Proc. Natl. Acad. Sci. USA 108 E1293
[23] de Juan D Pazos F Valencia A 2013 Nat Rev. Genet 14 249
[24] Morcos F Hwa T Onuchic J N Weigt M 2014 Methods Molecular Biology 1137 55
[25] Zhao Y Huang Y Gong Z Wang Y Man J Xiao Y 2012 Sci. Reports 2 734
[26] Wang J Zhao Y Zhu C Xiao Y 2015 Nucleic Acids Res. 43 e63
[27] Wang J Xiao Y 2017 Curr. Protoc Bioinf. 57 5
[28] Gouveia-Oliveira R Pedersen A G 2007 Algorithms Mol. Biol. 2 12
[29] Fodor A A Aldrich R W 2004 Proteins 56 211
[30] He X Li S Ou X Wang J Xiao Y 2019 Commun. Inf. Syst. 19 279
[31] Nawrocki E P Burge S W Bateman A Daub J Eberhardt R Y Eddy S R Floden E W Gardner P P Jones T A Tate J Finn R D 2015 Nucleic Acids Res. 43 D130
[32] Griffiths-Jones S Bateman A Marshall M Khanna A Eddy S R 2003 Nucleic Acids Res. 31 439
[33] Ekeberg M Hartonen T Aurell E 2014 J. Comput. Phys. 276 341
[34] Ekeberg M Lovkvist C Lan Y H Weigt M Aurell E 2013 Phys. Rev. 87 012707
[35] Danaee P Rouches M Wiley M Deng D Huang L Hendrix D 2018 Nucleic Acids Res. 46 5381
[36] Matthews B W 1975 Biochim. Biophysica Acta 405 442
[37] Parisien M Cruz J A Westhof E Major F 2009 RNA 15 1875
[38] Yang H Jossinet F Leontis N Chen L Westbrook J Berman H Westhof E 2003 Nucleic Acids Res. 31 3450
[39] Lu X J Bussemaker H J Olson W K 2015 Nucleic Acids Res. 43 e142
[40] Zok T Antczak M Zurkowski M Popenda M Blazewicz J Adamiak R W Szachniuk M 2018 Nucleic Acids Res. 46 W30
[41] Hamada M Sato K Asai K 2011 Nucleic Acids Res. 39 393
[42] Tabei Y Kiryu H Kin T Asai K 2008 Bmc Bioinf. 9 33
[43] Harmanci A O Sharma G Mathews D H 2011 Bmc Bioinf. 12 108
[44] Gruber A R Bernhart S H Lorenz R 2015 Methods Molecular Biology 1269 307